A Pragmatic Chinese Word Segmentation System
نویسندگان
چکیده
This paper presents our work for participation in the Third International Chinese Word Segmentation Bakeoff. We apply several processing approaches according to the corresponding sub-tasks, which are exhibited in real natural language. In our system, Trigram model with smoothing algorithm is the core module in word segmentation, and Maximum Entropy model is the basic model in Named Entity Recognition task. The experiment indicates that this system achieves Fmeasure 96.8% in MSRA open test in the third SIGHAN-2006 bakeoff.
منابع مشابه
Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach
This paper presents a pragmatic approach to Chinese word segmentation. It differentiates from most of the previous approaches mainly in three respects. First of all, while theoretical linguists have defined Chinese words with various linguistic criteria, Chinese words in this study are defined pragmatically as segmentation units whose definition depends on how they are used and processed in rea...
متن کاملA Pragmatic Chinese Word Segmentation Approach Based on Mixing Models
A pragmatic Chinese word segmentation approach is presented in this paper based on mixing language models. Chinese word segmentation is composed of several hard sub-tasks, which usually encounter different difficulties. The authors apply the corresponding language model to solve each special sub-task, so as to take advantage of each model. First, a class-based trigram is adopted in basic word s...
متن کاملEffective Document-Level Features for Chinese Patent Word Segmentation
A patent is a property right for an invention granted by the government to the inventor. Patents often have a high concentration of scientific and technical terms that are rare in everyday language. However, some scientific and technical terms usually appear with high frequency only in one specific patent. In this paper, we propose a pragmatic approach to Chinese word segmentation on patents wh...
متن کاملTowards Accurate Word Segmentation for Chinese Patents
A patent is a property right for an invention granted by the government to the inventor. An invention is a solution to a specific technological problem. So patents often have a high concentration of scientific and technical terms that are rare in everyday language. The Chinese word segmentation model trained on currently available everyday language data sets performs poorly because it cannot ef...
متن کاملA Two-stage Statistical Word Segmentation System for Chinese
In this paper we present a two-stage statistical word segmentation system for Chinese based on word bigram and wordformation models. This system was evaluated on Peking University corpora at the First International Chinese Word Segmentation Bakeoff. We also give results and discussions on this evaluation.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006